10 research outputs found

    Exploiting Pretrained Biochemical Language Models for Targeted Drug Design

    Full text link
    Motivation: The development of novel compounds targeting proteins of interest is one of the most important tasks in the pharmaceutical industry. Deep generative models have been applied to targeted molecular design and have shown promising results. Recently, target-specific molecule generation has been viewed as a translation between the protein language and the chemical language. However, such a model is limited by the availability of interacting protein-ligand pairs. On the other hand, large amounts of unlabeled protein sequences and chemical compounds are available and have been used to train language models that learn useful representations. In this study, we propose exploiting pretrained biochemical language models to initialize (i.e. warm start) targeted molecule generation models. We investigate two warm start strategies: (i) a one-stage strategy where the initialized model is trained on targeted molecule generation (ii) a two-stage strategy containing a pre-finetuning on molecular generation followed by target specific training. We also compare two decoding strategies to generate compounds: beam search and sampling. Results: The results show that the warm-started models perform better than a baseline model trained from scratch. The two proposed warm-start strategies achieve similar results to each other with respect to widely used metrics from benchmarks. However, docking evaluation of the generated compounds for a number of novel proteins suggests that the one-stage strategy generalizes better than the two-stage strategy. Additionally, we observe that beam search outperforms sampling in both docking evaluation and benchmark metrics for assessing compound quality. Availability and implementation: The source code is available at https://github.com/boun-tabi/biochemical-lms-for-drug-design and the materials are archived in Zenodo at https://doi.org/10.5281/zenodo.6832145Comment: 12 pages, to appear in Bioinformatic

    Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties

    Full text link
    Machine learning models have found numerous successful applications in computational drug discovery. A large body of these models represents molecules as sequences since molecular sequences are easily available, simple, and informative. The sequence-based models often segment molecular sequences into pieces called chemical words (analogous to the words that make up sentences in human languages) and then apply advanced natural language processing techniques for tasks such as de novo\textit{de novo} drug design, property prediction, and binding affinity prediction. However, the chemical characteristics and significance of these building blocks, chemical words, remain unexplored. This study aims to investigate the chemical vocabularies generated by popular subword tokenization algorithms, namely Byte Pair Encoding (BPE), WordPiece, and Unigram, and identify key chemical words associated with protein-ligand binding. To this end, we build a language-inspired pipeline that treats high affinity ligands of protein targets as documents and selects key chemical words making up those ligands based on tf-idf weighting. Further, we conduct case studies on a number of protein families to analyze the impact of key chemical words on binding. Through our analysis, we find that these key chemical words are specific to protein targets and correspond to known pharmacophores and functional groups. Our findings will help shed light on the chemistry captured by the chemical words, and by machine learning models for drug discovery at large.Comment: 16 pages, 11 figures, new computational analysis and extended case studie

    Neuroinflammation, Energy and Sphingolipid Metabolism Biomarkers Are Revealed by Metabolic Modeling of Autistic Brains

    No full text
    Autism spectrum disorders (ASD) are a heterogeneous group of neurodevelopmental disorders generally characterized by repetitive behaviors and difficulties in communication and social behavior. Despite its heterogeneous nature, several metabolic dysregulations are prevalent in individuals with ASD. This work aims to understand ASD brain metabolism by constructing an ASD-specific prefrontal cortex genome-scale metabolic model (GEM) using transcriptomics data to decipher novel neuroinflammatory biomarkers. The healthy and ASD-specific models are compared via uniform sampling to identify ASD-exclusive metabolic features. Noticeably, the results of our simulations and those found in the literature are comparable, supporting the accuracy of our reconstructed ASD model. We identified that several oxidative stress, mitochondrial dysfunction, and inflammatory markers are elevated in ASD. While oxidative phosphorylation fluxes were similar for healthy and ASD-specific models, and the fluxes through the pathway were nearly undisturbed, the tricarboxylic acid (TCA) fluxes indicated disruptions in the pathway. Similarly, the secretions of mitochondrial dysfunction markers such as pyruvate are found to be higher, as well as the activities of oxidative stress marker enzymes like alanine and aspartate aminotransferases (ALT and AST) and glutathione-disulfide reductase (GSR). We also detected abnormalities in the sphingolipid metabolism, which has been implicated in many inflammatory and immune processes, but its relationship with ASD has not been thoroughly explored in the existing literature. We suggest that important sphingolipid metabolites, such as sphingosine-1-phosphate (S1P), ceramide, and glucosylceramide, may be promising biomarkers for the diagnosis of ASD and provide an opportunity for the adoption of early intervention for young children

    Identification of Therapeutic Targets for Medulloblastoma by Tissue-Specific Genome-Scale Metabolic Model

    No full text
    Medulloblastoma (MB), occurring in the cerebellum, is the most common childhood brain tumor. Because conventional methods decline life quality and endanger children with detrimental side effects, computer models are needed to imitate the characteristics of cancer cells and uncover effective therapeutic targets with minimum toxic effects on healthy cells. In this study, metabolic changes specific to MB were captured by the genome-scale metabolic brain model integrated with transcriptome data. To determine the roles of sphingolipid metabolism in proliferation and metastasis in the cancer cell, 79 reactions were incorporated into the MB model. The pathways employed by MB without a carbon source and the link between metastasis and the Warburg effect were examined in detail. To reveal therapeutic targets for MB, biomass-coupled reactions, the essential genes/gene products, and the antimetabolites, which might deplete the use of metabolites in cells by triggering competitive inhibition, were determined. As a result, interfering with the enzymes associated with fatty acid synthesis (FAs) and the mevalonate pathway in cholesterol synthesis, suppressing cardiolipin production, and tumor-supporting sphingolipid metabolites might be effective therapeutic approaches for MB. Moreover, decreasing the activity of succinate synthesis and GABA-catalyzing enzymes concurrently might be a promising strategy for metastatic MB

    A network-based approach on elucidating the multi-faceted nature of chronological aging in S. cerevisiae.

    Get PDF
    BACKGROUND: Cellular mechanisms leading to aging and therefore increasing susceptibility to age-related diseases are a central topic of research since aging is the ultimate, yet not understood mechanism of the fate of a cell. Studies with model organisms have been conducted to ellucidate these mechanisms, and chronological aging of yeast has been extensively used as a model for oxidative stress and aging of postmitotic tissues in higher eukaryotes. METHODOLOGY/PRINCIPAL FINDINGS: The chronological aging network of yeast was reconstructed by integrating protein-protein interaction data with gene ontology terms. The reconstructed network was then statistically "tuned" based on the betweenness centrality values of the nodes to compensate for the computer automated method. Both the originally reconstructed and tuned networks were subjected to topological and modular analyses. Finally, an ultimate "heart" network was obtained via pooling the step specific key proteins, which resulted from the decomposition of the linear paths depicting several signaling routes in the tuned network. CONCLUSIONS/SIGNIFICANCE: The reconstructed networks are of scale-free and hierarchical nature, following a power law model with γ  =  1.49. The results of modular and topological analyses verified that the tuning method was successful. The significantly enriched gene ontology terms of the modular analysis confirmed also that the multifactorial nature of chronological aging was captured by the tuned network. The interplay between various signaling pathways such as TOR, Akt/PKB and cAMP/Protein kinase A was summarized in the "heart" network originated from linear path analysis. The deletion of four genes, TCB3, SNA3, PST2 and YGR130C, was found to increase the chronological life span of yeast. The reconstructed networks can also give insight about the effect of other cellular machineries on chronological aging by targeting different signaling pathways in the linear path analysis, along with unraveling of novel proteins playing part in these pathways

    Circadian clock crosstalks with autism

    No full text
    Abstract Background The mechanism underlying autism spectrum disorder (ASD) remains incompletely understood, but researchers have identified over a thousand genes involved in complex interactions within the brain, nervous, and immune systems, particularly during the mechanism of brain development. Various contributory environmental effects including circadian rhythm have also been studied in ASD. Thus, capturing the global picture of the ASD‐clock network in combined form is critical. Methods We reconstructed the protein–protein interaction network of ASD and circadian rhythm to understand the connection between autism and the circadian clock. A graph theoretical study is undertaken to evaluate whether the network attributes are biologically realistic. The gene ontology enrichment analyses provide information about the most important biological processes. Results This study takes a fresh look at metabolic mechanisms and the identification of potential key proteins/pathways (ribosome biogenesis, oxidative stress, insulin/IGF pathway, Wnt pathway, and mTOR pathway), as well as the effects of specific conditions (such as maternal stress or disruption of circadian rhythm) on the development of ASD due to environmental factors. Conclusion Understanding the relationship between circadian rhythm and ASD provides insight into the involvement of these essential pathways in the pathogenesis/etiology of ASD, as well as potential early intervention options and chronotherapeutic strategies for treating or preventing the neurodevelopmental disorder
    corecore